Hello data science world! You will find my repository here. I am new to all that, but I have been learning fast, so let’s see how far I can manage with the many bugs here.
During the first week we went through the general instructions of the course, deadlines, tools to learn etc.- and start working! We get familiar with R, RStudio and GitHub. We create our own GitHub repository: a place on the web to store your course codes and analyses.
This week we are learning how to manage with large datatables and create working size tables; a process called data wrangling. We then focus on data analysis, single and multiple regression analysis and model validation.
I will be mostly working with data produced by Kimmo Vehkalahti as part of his work for the study of:
“The relationship between learning approaches and students’ achievements in an introductory statistics course in Finland”
I am using a shorter version of the dataset after shortening the table to a more manageable size. Here is the structure of the table:
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Here, you can see the same data in a table format for your interest.
Here is a summary table of the above data:
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
But for better visualisation and understanding of the relationship from the available variables, here is a paired plot:
Let’s see what is the most significant factor affecting the final exam points of a student.
First, how is the exam points scored affected by the student’s attitude, age and strategic skills (based on their average response to strategic questions).
##
## Call:
## lm(formula = Points ~ Attitude + Age + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## Attitude 0.34808 0.05622 6.191 4.72e-09 ***
## Age -0.08822 0.05302 -1.664 0.0981 .
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
It looks like the most important factor affecting the student’s final exam result is purely their attitude (p-value > 0.001)! This is a multivariate regression model. Score on strategic and surface questions show no significant relationship with exam points scorred. The R-squared value of 0.22 implies that the model can explain 22 % or about one-fifth of the variation in the outcome.
Here, is the summary of the linear model with the two parameters alone:
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
This univariate model explains that the variable of exam points scored is significantly related to the students attitude (p-value > 0). At a theoretical attitude value of zero the points scored would be 11.64 with a standard error 1.83. For every point of attitude increased, there is 0.35 more exam points scored with a standard error of 0.06.
R-squared is always between 0 and 1 (and as percentage when multiplied by 100): 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, Multiple R-squared value of 0.19 implies that the model can explain only 19% of the variation in the outcome.
So, this is how the model actually looks like:
R makes it easy to graphically explore the validity of your model assumptions. If you give a linear model object as the first argument to the plot() function, the function automatically assumes you want diagnostic plots and will produce them. You can check the help page of plotting an lm object by typing ?plot.lm or help(plot.lm) to the R console.
The Residuals vs Fitted plot studies the variance of the residuals. This plot represents the deviation of the observed values of an element of a statistical sample from its “theoretical value”. The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest. Here, the linearity of the regression model is supported by a homogeneity of variance (or homoscedasticity) with residuals that appear close to 10 points difference from the fitted values. This is off course with the exception of the observations 145, 56 and 35. * In other words: + We are looking at how well the model describes the variables we are interested in + The target variable is modelled as a linear combination of the model parameters + Errors are normally distributed and have constant variance
A Q-Q (quantile-quantile) plot is a probability plot for comparing two probability distributions by plotting their quantiles against each other. Here, the two distributions being compared are the fitted model vs the observed values. Typically the standard deviation of residuals in a sample vary greatly from one datapoint to another even when the errors all have the same standard deviation, thus it does not make sense to compare residuals at different data points without first standartizing. Here, we see that the observations 145, 56 and 35 are again not good representations of the fitted model. Our model here is normally distrubuted and the non fitting observations appear at the extremes of the distribution and are expected to have very little impact on the model (this is studied as part of the next graph).
Finally, we plotted the standardised residuals against the leverage of the fitted model. Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. Plotting standardised residuals against leverage shows the impact of every single observation on the fitted model. There are some results that have a high impact on the model which suggests that the results are driven by a few data points; that’s what this plot is intended to help us determine. Interestingly, here the most influencial points are again 56 and 35 but 145 does not have very high leverage, but now, 71, a new observation appeared.
Note: A very good explanation came from this link.
This week, we are working with a new dataset. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks.
high and low alcohol consumption should be mentiond…
You can find out more about this dataset and the related study here.
But here is the variable names included in the following analysis:
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
And here is how the data has been structured:
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
Using the given data, we are going to study student’s behavior with regards to alcohol consumption. According to stereotypes, students who are being raised by one of the two parents alone (Pstatus = A, “apart”) are more likely to seek alcohol to soothen emotional distress. When those students are members of big families (famsize = GT3, “greater than 3”) the situation could lead to higher alcohol consumption. These students are also more likely to miss class (absences are numeric from 0 to 93). While those students who are busy with extracurricular activities (activities = yes) could be distracted, get more supervision and support which leads to less alcohol consumption and less abcences recorded from class. Let’s study now if that is true based on the given data sample.
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
activities - extra-curricular activities (binary: yes or no)
absences - number of school absences (numeric: from 0 to 93)
Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)
Comments missing…
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 382
##
##
## | Family size
## High alcohol use | GT3 | LE3 | Row Total |
## -----------------|-----------|-----------|-----------|
## FALSE | 201 | 67 | 268 |
## | 0.182 | 0.487 | |
## | 0.750 | 0.250 | 0.702 |
## | 0.723 | 0.644 | |
## | 0.526 | 0.175 | |
## -----------------|-----------|-----------|-----------|
## TRUE | 77 | 37 | 114 |
## | 0.429 | 1.146 | |
## | 0.675 | 0.325 | 0.298 |
## | 0.277 | 0.356 | |
## | 0.202 | 0.097 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 278 | 104 | 382 |
## | 0.728 | 0.272 | |
## -----------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 382
##
##
## | Parental status
## High alcohol use | A | T | Row Total |
## -----------------|-----------|-----------|-----------|
## FALSE | 26 | 242 | 268 |
## | 0.016 | 0.002 | |
## | 0.097 | 0.903 | 0.702 |
## | 0.684 | 0.703 | |
## | 0.068 | 0.634 | |
## -----------------|-----------|-----------|-----------|
## TRUE | 12 | 102 | 114 |
## | 0.038 | 0.004 | |
## | 0.105 | 0.895 | 0.298 |
## | 0.316 | 0.297 | |
## | 0.031 | 0.267 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 38 | 344 | 382 |
## | 0.099 | 0.901 | |
## -----------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 382
##
##
## | Extra-curricular activities
## High alcohol use | no | yes | Row Total |
## -----------------|-----------|-----------|-----------|
## FALSE | 122 | 146 | 268 |
## | 0.196 | 0.176 | |
## | 0.455 | 0.545 | 0.702 |
## | 0.674 | 0.726 | |
## | 0.319 | 0.382 | |
## -----------------|-----------|-----------|-----------|
## TRUE | 59 | 55 | 114 |
## | 0.460 | 0.414 | |
## | 0.518 | 0.482 | 0.298 |
## | 0.326 | 0.274 | |
## | 0.154 | 0.144 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 181 | 201 | 382 |
## | 0.474 | 0.526 | |
## -----------------|-----------|-----------|-----------|
##
##
Here is a summary of the fitted model testing the above hypothesis. This summary includes odds ratios and confidence intervals:
##
## Call:
## glm(formula = high_use ~ (alc$famsize == "GT3") + (activities ==
## "no") + (Pstatus == "A") + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3113 -0.8113 -0.7152 1.2350 1.8822
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.12493 0.26973 -4.171 3.04e-05 ***
## alc$famsize == "GT3"TRUE -0.36857 0.25541 -1.443 0.149
## activities == "no"TRUE 0.26059 0.23174 1.124 0.261
## Pstatus == "A"TRUE -0.27378 0.40287 -0.680 0.497
## absences 0.09125 0.02324 3.926 8.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 443.94 on 377 degrees of freedom
## AIC: 453.94
##
## Number of Fisher Scoring iterations: 4
## (Intercept) alc$famsize == "GT3"TRUE activities == "no"TRUE
## -1.12493318 -0.36856686 0.26059001
## Pstatus == "A"TRUE absences
## -0.27378186 0.09124544
## OR 2.5 % 97.5 %
## (Intercept) 0.3246742 0.1885776 0.5444944
## alc$famsize == "GT3"TRUE 0.6917250 0.4204341 1.1469385
## activities == "no"TRUE 1.2976955 0.8240954 2.0472447
## Pstatus == "A"TRUE 0.7604980 0.3320593 1.6307745
## absences 1.0955379 1.0488999 1.1490038
Interpret the results and compare them to your previously stated hypothesis. According to the above values, my hypothesis has not been proven to be correct. A large family size, parents living apart and lack of extracurricular activities have very little correlation with the outcome. To be more precise, the odds that a student will be consuming high levels of alcohol is nearly always (OR = 1.09). At the same time, the predictive value of the rest of the variables selected has a wide spread from very unlikely (Confidence intervals = 0.33) to very likely (CI = 2.04).
Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy. (0-3 points)
Here is how my current model is performing:
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68062827 0.02094241 0.70157068
## TRUE 0.26701571 0.03141361 0.29842932
## Sum 0.94764398 0.05235602 1.00000000
My hypothesis was pretty bad and it is proven by this data. At least my faith in humanity is now restored and some kind of easy stereotypes can be ignored (as they already were).
The variables that actually work from my model are here:
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3754 -0.8181 -0.7282 1.2652 1.7475
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.28190 0.15920 -8.052 8.12e-16 ***
## absences 0.08982 0.02299 3.907 9.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 447.45 on 380 degrees of freedom
## AIC: 451.45
##
## Number of Fisher Scoring iterations: 4
## (Intercept) absences
## -1.28189927 0.08981766
## OR2 2.5 % 97.5 %
## (Intercept) 0.2775097 0.2012914 0.3760343
## absences 1.0939748 1.0479459 1.1468072
I know, it is only one variable…but it seems that students who missed class regularly are more likely to be at risk of being high alcohol consumers.
Let’s see how these two models compare with a likelihood ratio test:
## Analysis of Deviance Table
##
## Model 1: high_use ~ (alc$famsize == "GT3") + (activities == "no") + (Pstatus ==
## "A") + absences
## Model 2: high_use ~ absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 377 443.94
## 2 380 447.45 -3 -3.502 0.3205
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68848168 0.01308901 0.70157068
## TRUE 0.27486911 0.02356021 0.29842932
## Sum 0.96335079 0.03664921 1.00000000
This week we are using the Boston dataset as provided by the package MASS. Harrison and Rubinfeld 1978 published this data for the first time but you can find more information, on a nice layout and simplyfied form, here.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506 14
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
From the data structure we can already see that most variables are numeric, the only variable that is integral is called chas and is a Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). The following plots aim to help with the investigation of the data relationships.
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
So, let’s standardise the dataset and see how it looks. Data have been standardised with this formula: \[scaled(x)= \frac{x−mean(x)}{sd(x)} \]
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
To make it easier to work with, I am going to transform the crime variable into quantile bins and basically make it a categorical variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
## crime
## low med_low med_high high
## 127 126 126 127
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
Then, I am going to divide the dataset into two random groups, the train set and the test set. The train set consists of 80 % of the observations.
Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.
Linear discriminant analysis is closely related to many other methods, such as principal component analysis (we will look into that next week) and the already familiar logistic regression.
LDA can be visualized with a biplot. The LDA biplot arrow function used in the exercise is (with slight changes) taken from this Stack Overflow message thread.
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2599010 0.2475248 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 1.0035978 -0.9680958 -0.1148451 -0.9042817 0.451135609
## med_low -0.1111464 -0.2393904 -0.1223443 -0.5667784 -0.146847035
## med_high -0.4092038 0.2401365 0.2001230 0.3974445 0.005615594
## high -0.4872402 1.0171737 -0.1132543 1.0911183 -0.493433724
## age dis rad tax ptratio
## low -0.8984698 0.9302293 -0.6775274 -0.7172732 -0.47831855
## med_low -0.3436034 0.3367864 -0.5476413 -0.4490689 -0.03224992
## med_high 0.3958987 -0.3779279 -0.4110831 -0.2752934 -0.23027535
## high 0.8630468 -0.8748867 1.6375616 1.5136504 0.78011702
## black lstat medv
## low 0.37708439 -0.76173034 0.53399349
## med_low 0.34859173 -0.10133433 -0.02086026
## med_high 0.07875764 0.03049081 0.11549219
## high -0.69871933 0.99044033 -0.73634050
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.118984613 0.721320868 -0.805895774
## indus -0.024268549 -0.332094543 0.476827253
## chas -0.007723292 -0.041780524 -0.059544977
## nox 0.403966820 -0.612733653 -1.485626345
## rm 0.006985266 0.009994678 -0.094195185
## age 0.276535719 -0.243972787 -0.259278622
## dis -0.126937979 -0.144040056 -0.007749645
## rad 3.125463265 0.933367654 0.067733879
## tax 0.021995574 0.033025691 0.401302437
## ptratio 0.112030665 -0.016774734 -0.263919571
## black -0.123930425 0.017650941 0.162839109
## lstat 0.213701427 -0.152082442 0.391487987
## medv 0.095947383 -0.373139891 -0.251273242
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9456 0.0402 0.0142
Expectedly the prior probability of groups results to about 1 fourth in each case, since the classes created were based the data divided in quantiles. It seems that LD1 explains 94.5 % of the group variance, I will therefore use `dimen = 2´.
Now, I can better understand the coefficients of linear discriminants based on the arrows on vusualised on the plot. It seems that the variable rad (index of accessibility to radial highways) has the biggest impact on crime (Coefficient of linear discriminants = 3.15 for LD1). Next, comes the variable zn (proportion of residential land zoned for lots over 25,000 sq.ft.) and nox (nitrogen oxides concentration (parts per 10 million)) that separate low from medium crime on a gradient.
Let’s see how is the model performing now. I am going to use the test dataset, the 20 % of the observations that were separated earlier from the main table. Is my model going to predict everything correctly?
## predicted
## correct low med_low med_high high
## low 13 12 2 0
## med_low 4 10 7 0
## med_high 1 6 18 1
## high 0 0 0 28
K-means 10 clusters works best here.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)
lda.fit2 <- lda(crim ~., data = as.data.frame(boston_scaled2) )## Warning in lda.default(x, grouping, ...): variables are collinear
# lda.fit2
# visualize the results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 2)Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)## [1] 404 13
dim(lda.fit$scaling)## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime )plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes
)